Project Summary

This project analyzes customer churn data to identify the key factors influencing whether a customer leaves a telecom service. Using skills in data wrangling, exploratory data analysis (EDA), and statistical correlation, the project uncovers that features like contract type, tenure, and payment method are highly associated with customer retention. Visualization, feature interpretation, and data storytelling are applied to translate patterns into actionable insights for improving customer retention strategies.

The interactive dashboard below allows you to explore the data, visualize patterns, and even simulate customer scenarios to predict churn probability based on various attributes.

Overview & Background

A business will measure customer churn as the loss of existing customers continuing doing business or using their service with the company, compared to the total number of customers in a given period of time. Analyzing customer churn is important for a business to understand why a customer will stop using their service or want to stop doing business with them. Improving their customer retention is good for building brand loyalty and increasing overall customer satisfaction and profitability. While there are formulas that are easy to calculate what the customer churn is, it is difficult to accurately predict.

This dataset that I will be using comes from a telecommunication company and it provides the home phone and internet services to 7043 customers in California.

The data set includes information about:

In this project I will analyze the different factors that affect customer churn by creating regression models to identify correlation as well as creating a survival analysis model. I also create a prediction model using classification machine learning to accuractely predict the likeliness of a customer to churn.

Objectives

The data comes from Kaggle and can be accessed here

Python Code for Libraries Imported
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines.plotting import plot_lifetimes
from lifelines import KaplanMeierFitter
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, classification_report, log_loss
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
from xgboost import XGBClassifier
            

Summary Statistics

Python Code for Summary Statistics
#Frequency Table for contract type
contracttype_churn_counts = df.groupby(['Churn', 'Contract']).size().
	unstack(fill_value=0)
print(contracttype_churn_counts)
print("\n")

# Normalized frequency table for the contract type
contract_table_percent = contracttype_churn_counts.
	div(contracttype_churn_counts.sum(axis=1), axis=0) * 100
print(contract_table_percent)

# Frequency for each type of payment method
payment_stats = df['PaymentMethod'].value_counts(normalize=True) * 100
payment_stats
payment_dist = df.groupby('Churn')['PaymentMethod'].value_counts().unstack()
print(payment_dist)
print('\n')

# Percentage for Payment Method in Customer Churn overall
payment_churn_percentage = df.groupby('Churn')['PaymentMethod'].
↪value_counts(normalize=True).mul(100).unstack()
print(payment_churn_percentage)

#Frequency tables for each telecommunication service
df = df.rename(columns={
	'PhoneService': 'Phone Service',
	'MultipleLines': 'Multiple Lines',
	'InternetService': 'Internet Service',
	'OnlineSecurity': 'Online Security',
	'OnlineBackup': 'Online Backup',
	'DeviceProtection': 'Device Protection',
	'TechSupport': 'Tech Support',
	'StreamingTV': 'Streaming TV',

	'StreamingMovies': 'Streaming Movies',})
service = ['Phone Service', 'Multiple Lines', 'Internet Service', 'Online
	Security','Online Backup', 'Device Protection', 'Tech Support', 'Streaming
	TV', 'Streaming Movies']
def generate_service_frequency_by_churn(df, service):
	for col in service:
		print(f"\nFrequency Table for '{col}' (Grouped by Churn):")
		print(df.groupby('Churn')[col].value_counts()) # Raw counts
		print("\nPercentage Distribution by Churn:")
		print(df.groupby('Churn')[col].value_counts(normalize=True).mul(100).round(2))
		print("-" * 60)
generate_service_frequency_by_churn(df, service)

# Create frequency table for each demographic
df = df.rename(columns = {
	'gender' : 'Gender',
	'SeniorCitizen' : 'Senior Citizen'
})
df['Senior Citizen'] = df['Senior Citizen'].replace({0: 'No', 1: 'Yes'})
demographics = ['Gender', 'Senior Citizen', 'Partner', 'Dependents']
def generate_demographic_frequency_by_churn(df, demographics):
	for col in demographics:
		print(f"\nFrequency Table for '{col}' (Grouped by Churn):")
		print(df.groupby('Churn')[col].value_counts())
		print("\nPercentage Distribution by Churn:")
		print(df.groupby('Churn')[col].value_counts(normalize=True).mul(100).round(2))
		print("-" * 60)
generate_service_frequency_by_churn(df, demographics)
            
Summary Statistics
Churn Relationship

Churn Analysis Visualization

Visualizations for Charges & Tenure

Python Code for Churn Analysis Visualization
# Data visualization of churn/no churn based on monthly charges
sns.kdeplot(df.MonthlyCharges[df["Churn"] == 'No'], fill = True, label="No Churn")
sns.kdeplot(df.MonthlyCharges[df["Churn"] == 'Yes'], fill = True, label="Churn")
plt.title('Monthly Charges by Churn (KDE PLOT)')
plt.xlabel('Monthly Charges')
plt.ylabel('Density')
plt.show()

# Data visualization of churn/no churn based on total charges (kde plot)
sns.kdeplot(df.TotalCharges[df["Churn"] == 'No'], fill = True, label="No Churn")
sns.kdeplot(df.TotalCharges[df["Churn"] == 'Yes'], fill = True, label="Churn")
plt.title('Total Charges by Churn (KDE PLOT)')
plt.xlabel('Total Charges')
plt.ylabel('Density')
plt.show()

# Data visualization for tenure by churn
sns.violinplot(data = df, x = 'Churn', y = 'tenure')
plt.title('Tenure by Churn')
plt.xlabel('Churn')
plt.ylabel('Tenure')
plt.show()
            

Monthly Charges by Churn (KDE Plot)

Churn Rate by Contract Type

Customers who churned had higher monthly charges on average. You can see the orange curve peaking around 70–100 USD. Customers who did not churn have two notable clusters: one peak at low monthly charges (around 20 USD) and another smaller one around 70–90 USD. This smooth curve helps you see general distribution trends and compare how spread out or concentrated the values are.

Total Charges by Churn (KDE Plot)

Churn Rate by Payment Method

Non-churners have spent much more over time, which makes sense since they’ve stayed longer. The huge gap between median for churners (703 USD) and non-churners (1,679 USD) shows churners often leave before investing much. This is very similar to the monthly charges but takes tenure into account. Churned customers (orange) are clustered at low total charges, typically under 2000 USD, with a sharp peak very early. Non-churned customers (blue) are more widely distributed, with a long tail reaching 8000–9000 USD, suggesting they’ve been with the company longer.

Tenure by Churn (Violin Plot)

Churn Rate by Tenure

The distribution for the non-churned customers is even shows a normal distribution while the churned customers shows a larger concentration around 0-10 months while showing a right skew. This shows that those who churn have used the service for a short amount of time. The customers who churn overall have lower monthly and total charges and will churn after a shorter amount of time. This is a sign of the company possibly not being able to keep customer retention in the beginning of the service. There is a lot of customer loyalty since the tenure for non-churned customers is almost double those who do churn and monthly charges are also overall higher.

Visualizations for other Variables

Python Code for Churn Analysis Visualization
# Plot the bar chart for contract type by churn
contracttype_churn_counts.plot(kind='bar', figsize=(8, 6))
plt.title('Contract Type by Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(['Month-to-Month', 'One Year', 'Two Year'], loc='upper right')
plt.show()

# Plot the bar chart for payment method by churn
payment_churn_counts.plot(kind='bar', figsize=(8, 6))
plt.title('Payment Method by Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(['Bank transfer (automatic)', 'Credit card', 'Electronic check',␣
	'Mailed check'], loc='upper right')
plt.show()

# Data visualization for different telecommunication services by Churn
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 12))
axes = axes.flatten() 
for i, col in enumerate(service):
	sns.countplot(data=df, x='Churn', hue=col, ax=axes[i])
	axes[i].set_title(f"{col} by Churn")
	axes[i].set_xlabel("Churn")
	axes[i].set_ylabel("Count")
	axes[i].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show()

# Data visualization for different demographics by Churn
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12)) # 2x2 grid
axes = axes.flatten()
for i, col in enumerate(demographics):
	sns.countplot(data=df, x='Churn', hue=col, ax=axes[i])
	axes[i].set_title(f"{col} by Churn")
	axes[i].set_xlabel("Churn")
	axes[i].set_ylabel("Count")
	axes[i].tick_params(axis='x', rotation=0)
plt.tight_layout()
plt.show()

            

Key Metrics

26.5%
Overall Churn Rate
of customers churned overall
18 months
Avg. Customer Tenure
average customer lifespan
$75
Avg. Monthly Cost
from customers likely to churn
$1.5K
Avg. Total Charges
from customers likely to churn

Churn Survival Analysis

Python Code for Churn Survival Analysis based on Variable
# Survival Analysis using Telco Data
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})
durations = df['tenure']
event_observed = df['Churn']
km = KaplanMeierFitter()
km.fit(durations, event_observed, label='Kaplan Meier Estimate')
km.plot()
def time_at_survival_threshold(kmf, threshold):
	sf = kmf.survival_function_
	return sf[sf[kmf._label] <= threshold].index.min()
thresholds = [0.8, 0.65]
colors = ['blue', 'green']
for thresh, color in zip(thresholds, colors):
	time = time_at_survival_threshold(km, thresh)
	if pd.notna(time):
		plt.axhline(thresh, color=color, linestyle='dashed')
		plt.axvline(time, color=color, linestyle='dashed')
		plt.text(time + 1, thresh + 0.02,
			f"{int(thresh*100)}% survival  {time} months",
			color=color, fontsize=10)
plt.title('Survival Curve using Telco data')
plt.ylabel('Likelihood of Survival')

# Survival Analysis for Payment Method
kmf_ch1 = KaplanMeierFitter()
T1 = df.loc[df['PaymentMethod'] == 'Bank transfer (automatic)', 'tenure']
E1 = df.loc[df['PaymentMethod'] == 'Bank transfer (automatic)', 'Churn']
kmf_ch1.fit(T1, E1, label='Bank transfer')
ax = kmf_ch1.plot(ci_show=False)

kmf_ch2 = KaplanMeierFitter()
T2 = df.loc[df['PaymentMethod'] == 'Credit card (automatic)', 'tenure']
E2 = df.loc[df['PaymentMethod'] == 'Credit card (automatic)', 'Churn']
kmf_ch2.fit(T2, E2, label='Credit card')
ax = kmf_ch2.plot(ci_show=False)

kmf_ch3 = KaplanMeierFitter()
T3 = df.loc[df['PaymentMethod'] == 'Electronic check', 'tenure']
E3 = df.loc[df['PaymentMethod'] == 'Electronic check', 'Churn']
kmf_ch3.fit(T3, E3, label='Electronic check')
ax = kmf_ch3.plot(ci_show=False)

kmf_ch4 = KaplanMeierFitter()
T4 = df.loc[df['PaymentMethod'] == 'Mailed check', 'tenure']
E4 = df.loc[df['PaymentMethod'] == 'Mailed check', 'Churn']
kmf_ch4.fit(T4, E4, label='Mailed check')
ax = kmf_ch4.plot(ci_show=False)

plt.title("Churn Duration based on Payment Method Survival Curve")
plt.ylabel('Likelihood of Survival');

# Survival Analysis for Contract Type
kmf_ch1 = KaplanMeierFitter()
T1 = df.loc[df['Contract'] == 'Month-to-month', 'tenure']
E1 = df.loc[df['Contract'] == 'Month-to-month', 'Churn']
kmf_ch1.fit(T1, E1, label='Month-to-month')
ax = kmf_ch1.plot(ci_show=False)

kmf_ch2 = KaplanMeierFitter()
T2 = df.loc[df['Contract'] == 'One year', 'tenure']
E2 = df.loc[df['Contract'] == 'One year', 'Churn']
kmf_ch2.fit(T2, E2, label='One year')
ax = kmf_ch2.plot(ci_show=False)

kmf_ch3 = KaplanMeierFitter()
T3 = df.loc[df['Contract'] == 'Two year', 'tenure']
E3 = df.loc[df['Contract'] == 'Two year', 'Churn']
kmf_ch3.fit(T3, E3, label='Two year')
ax = kmf_ch3.plot(ci_show=False)

plt.title("Churn Duration based on Contract Term Survival Curve")
plt.ylabel('Likelihood of Survival');
            

Churn Survival by Contract Type

Churn Rate by Contract Type

The month-to-month contract has a very short likelihood of survival, the curve reaches 0% at around 70 months, and reaches 50% probability of survival around 35 months. While the two-year contract has a very high likelihood of survival since the customers who have the subscription for 2 years likely are satisfied with the service and are not going to churn. The one-year subscription drops a lot ater the 60 month mark, which shows that the length for the one-year is about half the length of survival and probability as those with the two-year subscription.

Churn Survival by Payment Method

Churn Rate by Payment Method

This outcome is very similar to the bar chart that compared the no churn and churn by payment method. The bank transfer and credit card have very similar survival curves. The mailed check is slightly lower, which might be due to the fact that it is manual so customers might be quicker to drop the service. The electronic check is very similar in that fact, however there is a much higher proportion who have churned.

Churn Rate by Tenure

There is 80% probability of survival beyond about 22 months and 65% probability of survival beyond about 65 months.

Key Insights

High Risk Customer Segments

  • New customers (0-12 months) with month-to-month contracts
  • Customers paying via electronic check
  • Fiber optic internet users without tech support
  • Customers with high monthly charges ($70-90 range)
  • Customers without online security features

Recommended Retention Strategies

  • Offer contract upgrades with incentives to month-to-month customers
  • Promote automatic payment methods with discounts
  • Bundle tech support with fiber services at reduced rates
  • Introduce loyalty programs for customers in months 7-12
  • Create targeted service bundles for high ARPU customers

Key Churn Factors

  • Contract type is the strongest predictor of churn behavior
  • Tenure shows strong inverse relationship with churn probability
  • Payment method significantly influences retention
  • Technical support services are critical for fiber customers
  • Price sensitivity peaks in specific monthly charge ranges

Correlation between Churn & Others

Python Code for Correlation
# Create Correlation Table
df_copy = df.copy()
columns = df_copy.columns
label_encoder = LabelEncoder()
for col in columns:
	df_copy[col] = label_encoder.fit_transform(df_copy[col])
df_copy
correlation_matrix = df_copy.corr()
churn_correlation = correlation_matrix['Churn'].sort_values(ascending=False)
print(churn_correlation)

# Visualize with a heat map
plt.figure(figsize=(15, 12)) # Adjust figure size
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5, vmin = -1, vmax = 1)
plt.title('Feature Correlation Heatmap')
plt.show()
            

Correlation

Churn Rate by Contract Type
Churn Rate by Payment Method

Correlation Heatmap

Churn Rate by Tenure

Monthly Charges (0.183523): Monthly Charges out of the variables we analyzed has the highest correlation with churn. Customers might churn due to the fact that the monthly charges might be too high. This also might be associated with bundles, which comes with services that the customer doesn't use.

Senior Citizens (0.150889): Citizens that are also older will have a higher association and probability of churning. This might be because senior citizens are not as tech savvy and might not need as many subscriptions. They might be more price sensitive and not need the telecommunication services.

Contract (-0.396713): Many of the customers who do have the one-year or two-year contract are less likely to cancel their subscription since the contract goes for a full one or two years. If someone wants to cancel their subscription there also might be fees for cancelling earlier than their contract ends. Those on year-long contract might also have discounts and bundles that incentivizes them to stay.

Machine Learning Models

Random Forest Model

Python Code for Random Forest Model
# Define features and target
X = df_copy.drop(columns=['Churn']) 
y = df_copy['Churn']
X = pd.get_dummies(X, drop_first=True)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
            
# Train with Random Forest
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(X_train, y_train)
y_pred_RF = random_forest.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred_RF)
print(f"Accuracy: {accuracy * 100}%")

# Confusion matrix for Random Forest
confusion_matrix_RF = metrics.confusion_matrix(y_test, y_pred_RF)
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(confusion_matrix_RF), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion Matrix for Random Forest Classifier', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

# Predict probabilities using ROC Curve
y_probs = random_forest.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, thresholds = roc_curve(y_test, y_probs)
auc = metrics.roc_auc_score(y_test, y_probs)
plt.plot(fpr_rf, tpr_rf, label="data 1, auc="+str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC Curve',fontsize=16)
plt.show();
			

Confusion Matrix

Confusion Matrix
  • True Negatives (TN): 946 – Model correctly predicted class 0
  • False Positives (FP): 90 – Model incorrectly predicted 1 when it was actually 0
  • False Negatives (FN): 197 – Model incorrectly predicted 0 when it was actually 1
  • True Positives (TP): 176 – Model correctly predicted class 1

Metrics you can infer:

  • Accuracy = (TP + TN) / Total = (946 + 176) / (946 + 90 + 197 + 176) ≈ 78.5%
  • Precision (for class 1) = TP / (TP + FP) = 176 / (176 + 90) ≈ 66.2%
  • Recall (for class 1) = TP / (TP + FN) = 176 / (176 + 197) ≈ 47.2%

ROC Curve

ROC Curve

The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the false positive rate. It shows the difference between positive and negative classes. The red curve being well above the diagonal line (the black dashed line representing random guessing) indicates that your model performs much better than random. The area under the ROC curve (AUC) appears to be quite good, probably around 0.85–0.9, suggesting strong discriminative ability.

Logistical Regression Model

Python Code for Logistical Regression Model
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression model
logisticRegr = LogisticRegression(max_iter=500)
logisticRegr.fit(X_train_scaled, y_train)

# Model evaluation
accuracy = logisticRegr.score(X_test_scaled, y_test)
y_pred_log = logisticRegr.predict(X_test_scaled)
print(f"Accuracy: {accuracy * 100}%")

# Fit logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()
print(result.summary())

confusion_matrix = metrics.confusion_matrix(y_test, y_pred_log)

# Create confusion matrix heat map for Logistic Regression
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# create heatmap
sns.heatmap(pd.DataFrame(confusion_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix for Logistical Regression', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
			

Confusion Matrix

Churn Rate by Contract Type
  • True Negatives (TN): 938 – Model correctly predicted class 0
  • False Positives (FP): 98 – Model incorrectly predicted 1 when it was actually 0
  • False Negatives (FN): 165 – Model incorrectly predicted 0 when it was actually 1
  • True Positives (TP): 208 – Model correctly predicted class 1

Metrics you can infer:

  • Accuracy = (TP + TN) / Total = (938 + 208) / (938 + 98 + 165 + 208) ≈ 81.3%
  • Precision (for class 1) = TP / (TP + FP) = 208 / (208 + 98) ≈ 67.97%
  • Recall (for class 1) = TP / (TP + FN) = 208 / (208 + 165) ≈ 55.76%

Regression Table

Churn Rate by Payment Method

The table shows the estimated coefficients, standard errors, and significance levels for each predictor in the logistic regression model. Key factors negatively associated with churn include having a contract, tech support, and online security services. Conversely, features like paperless billing and higher monthly charges show a positive association with churn. Variables such as gender and payment method were not statistically significant in predicting churn.

81.3%

Accuracy

The Random Forest model achieved an overall accuracy of approximately 78.5%, as indicated by the confusion matrix. While it demonstrated strong performance in correctly identifying non-churners, it struggled with correctly classifying churners, reflected in a relatively low recall for the positive class (around 47.2%). In contrast, the improved Logistic Regression model, though slightly lower in raw predictive performance (as Logistic Regression models often are), offered better interpretability and highlighted statistically significant factors driving churn, such as contract type, tech support, and online security services. After refining the feature set and tuning hyperparameters, the Logistic Regression model's accuracy improved and approached that of the Random Forest, with additional benefits of clarity and actionable business insights. Ultimately, while the Random Forest may slightly outperform in predictive power, the improved Logistic Regression model bridges much of that gap and excels in explainability — making it a strong choice for stakeholder-facing use or policy-making decisions.